Thus like in law, we accept this null hypothesis, unless we have enough evidence to suggest that we should reject it - similarly to the idea that we by default assume that indivudals are not guilty until proven otherwise. If we reject the null hypothesis, we then accept an alternative hypothesis , this can vary for different statistical tests, but in the case of the t-test we evaluate if the two means are not equal:
Remeber that we can’t know the true mean for our populations of interest, instead we approximate or estimate our means based on the sample that we have in our data.
We are interested in comparing the means of female rural and urban BMI measurements for both years.
There are two possible classes of statistical tests that we could run to compare the means of these two groups:
Parametric tests are based on assumptions about the distribution of the data, while Nonparametric tests do not rely on this assumption. It is called “parametric” because aspects about the distribution of the data like the mean aer called parameters when we describe a population. In parametric tests we estimate the parameters of the true population of interest using a sample of that population. These estiamtes are called statistics. See here for more information about the difference between these two classes of tests.
Parametric two sample mean tests
Often when comparing two groups we might perform a two sample t-test to determine if the means of each group is different. The two sample t-test however, relies on several assumptions:
- The data for both groups is normally distributed
- The variance of both groups is similar
- The number of observations is similar for both groups - thus they are balanced
- The observations are independent (meaning that observations do not influence eachother)
If these assumptions are violated, this doesn’t necessarily mean we can’t perform a t-test. It just means we may need to consider the following options:
- Transformation of the data to make it more normally distributed
- Welch’s t-test also call the unequal variance t-test we may need to modify the way we perform the t-test to account for the difference in the variance in the two groups
- Permutation/resampling methods to deal with violations of normality or imbalance.
Alternatively, we can use a nonparametric test like the Wilcoxon–Mann–Whitney (WMW) test. These tests are often a good option when multiple assumptions are violated are when sample sizes are small. We will explore these options.
Our data has a balance of observations for both groups - in fact they are equal, thus that assumption is not violated. If it were violated, we would want to consider using permutation methods which are also a good option for violations of normality. To learn more about these methods see here.
If we needed to check if our samples were imbalanced, we could use the table() function:
, , = 1985
National Rural Urban
Men 200 200 200
Women 200 200 200
, , = 2017
National Rural Urban
Men 200 200 200
Women 200 200 200
We can see that the number of observations for each possible group of interest is the same.
The t-test is also fairly robust to non-normality if the data is relatively large, due to what is called the central limit theorem.
We have an n of 200, which should be sufficient but let’s investigate the nonparametric tests further.
Often we would check if the variance of the rural and urban data is equal using the var.test() function. However this is an F test and assumes that the data is normally distributed. Instead we will use the mood.test() function which performs the Mood’s two-sample test for a difference in scale parameters and does not assume that the data is normally distributed. We will also introduce the pull() function of the dplyr package.
Mood two-sample test of scale
data: dplyr::pull(filter(BMI_long, Sex == "Women", Year == "2017", and dplyr::pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and Region == "Urban"), BMI)
Z = 2.9189, p-value = 0.003513
alternative hypothesis: two.sided
# p value <.05, conclude that variance is not equal
# reject the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
Z = 3.1305, p-value = 0.001745
alternative hypothesis: two.sided
# p value <.05, conclude that variance is not equal
# reject the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
Z = -0.24228, p-value = 0.8086
alternative hypothesis: two.sided
# p value >.05, conclude that variance is equal
# accept the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
Z = 1.5317, p-value = 0.1256
alternative hypothesis: two.sided
Our p value is less than .05 for both tests, thus we reject our null hypothesis that there is no difference in the variance. Therefore, we conclude that the variance is not equal and that our data also violates this assumption.
We will perform a special t.test where we account for the fact that our variance is not equal.
Another important consideration is that the data is what we call paired, meaning that the measurements from the rural and urban areas are not independent. That is because we have a rural and urban measurement mean for nearly every country. Thus these values may be more similar to one another if they come from the same country. This is also true for the male and female measurements from the same country or the values in the same countries from 1985 and later in 2017. However, we are assuming that measurements between different countries are independent, thus this assumption is not violated, making it reasonable to peform the welch’s or paired t-test.
When we perform a paired t-test our hypothesis is slightly different from the typical student’s t-test. In this case we are testing the differences among the pairs of observations and how close these differences are to zero.In this case our null hypothesis is that the mean differences is equal to zero:
Ho: μd = 0
μd is the true mean differences
between paired observations of the two groups
In this case the alternative hypthesis is that the mean of the differences is not equal to zero
Ha: μd ≠ 0
μd is the true mean differences
between paired observations of the two groups
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
t = -10.356, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0573625 -0.7190478
sample estimates:
mean of the differences
-0.8882051
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
t = -14.095, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1870263 -0.8956268
sample estimates:
mean of the differences
-1.041327
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
t = -22.119, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.591762 -2.167422
sample estimates:
mean of the differences
-2.379592
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
t = -24.378, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.383938 -2.027118
sample estimates:
mean of the differences
-2.205528
Question opportunity: Looking at the t value, was global BMI lower in Rural or Urban areas in 1985?
Now we will try transform our data to make it more normally distributed. One way to do this is to take the logarithm of the data values. Then we will see how this influences the results. Again we will focus on the data for women.

# A tibble: 6 x 3
# Groups: Year [2]
Year Region shapiro_test
<chr> <chr> <dbl>
1 1985 National 0.00315
2 1985 Rural 0.0105
3 1985 Urban 0.000334
4 2017 National 0.293
5 2017 Rural 0.0784
6 2017 Urban 0.00416
# A tibble: 6 x 3
# Groups: Year [2]
Year Region shapiro_test
<chr> <chr> <dbl>
1 1985 National 0.0000478
2 1985 Rural 0.000679
3 1985 Urban 0.000000332
4 2017 National 0.00363
5 2017 Rural 0.0108
6 2017 Urban 0.0000130
The data appears to be more similar to the normal distribution, although not quite. Again, our sample size of 200 is quite large and the t-test is generally quite robust to violations of normality with large n, thus the modified t-test to account for unequal variance might be a good option using the log normalized data, as it is at least more normally distributed.
Let’s see the results of the t-test with the transformed data:
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -10.058, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.04242774 -0.02851589
sample estimates:
mean of the differences
-0.03547182
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -13.962, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.05214677 -0.03923811
sample estimates:
mean of the differences
-0.04569244
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Rural"), log_BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Rural"), log_BMI)
t = -22.369, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.10617051 -0.08896626
sample estimates:
mean of the differences
-0.09756839
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Urban"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Urban"), log_BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Urban"), log_BMI) and "Urban"), log_BMI)
t = -23.977, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09377834 -0.07952498
sample estimates:
mean of the differences
-0.08665166
We can see that our results are quite similar to that of the original data, however the t values are slightly smaller. In other cases we may see a much more dramatic influence of transforming our data.
Now, let’s take a look at nonparametric tests, which are also a great option when the assumptions of the t-test are violated.
Nonparametric two sample tests
There are two nonparametric options to consider when the assumptions of the t-test are violated. The Wilcoxon signed rank test (for paired data - the alternative is Wilcoxon rank sum test (also called the Mann-Whitney U test) for independent samples) and the two-sample Kolmogorov-Smirnov test both do not assume normality (has both paired and unpaired methods). Thus these tests should be considered when the data of either groups does not appear to be normally distributed and particularly when the number of samples is low.
Importantly the two-sample Kolmogorov-Smirnov (KS) test does not assume normality or equal variance, while the Wilcoxon signed rank test does assume equal variance. Here is how you would perform these tests. However in our case, because the variance is not equal between some of our groups of interest, the KS test would be more appropriate. Both the t-test and the KS test evaluate if the distributions of the two groups are identical, however the KS test does not particularly test any aspect of the distribution like the mean, therefore there are no confidence intervals in the output using this test.
ks.test(pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.20006, p-value = 0.0007385
alternative hypothesis: two-sided
ks.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.19914, p-value = 0.0007779
alternative hypothesis: two-sided
What about the difference in female BMI from 1985 to 2017 for both regions? Recall that the variance was equal for these comparisons.
wilcox.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
V = 273, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
wilcox.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
V = 189, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
There is a significant difference across time for both regions, as we saw with the t-test. There is also a significant difference by region for each year. However, the p-values are a bit larger for the KS test results than we saw with the t-test.